Data visualization is an important (and fun) part of data science, nevertheless data is never tidy and clean (unless, of course, in textbook examples), so it is always necessary to be able to clean and transform our data sets so they better suit our needs. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations. The tidyverse provides several tools to make our life easier while cleaning and transforming our data, the Swiss knife for this is dplyr
In this part of the course we will use dplyr package to manipulate the data, it provides a set of functions (verbs) to help us solve the most common data manipulation challenges like:
All verbs work similarly:
For this exercise we’ll use a new data set on flights departing from New York City in 2013 named nycflights13. First we need to install the package.
install.packages("nycflights13")
## Installing package into '/home/plablo/R/x86_64-pc-linux-gnu-library/3.4'
## (as 'lib' is unspecified)
And tell R that we want to use it, along with the Tidyverse suite.
library(nycflights13)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.8.0 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Now we can take a peek at our data
nycflights13::flights
As you can see this data set contains 10,000 rows so it would be nice to use some sort of filtering to get acquainted with the data.
The filter verb allows us to select a subset of our data based on a condition. The first argument is the name of the data frame, followed by the expresions that filter out the data. For instance, suppose we want to get the flights departing on January 1st, then we can filter our data like this:
filter(nycflights13::flights, month == 1, day == 1)
When you run that line of code, dplyr executes the filtering operation and returns a new data frame. To save the result, you’ll need to use the assignment operator (<-):
(jan1 <- filter(nycflights13::flights, month == 1, day == 1))
Now we have a variable called jan1 that stores our filtered data. Notice the parenthesis around the assignment expression, that is a way to tell the R interpreter that, besides the assignment, we want to see the result printed out.
Of course, we can use logical operators like: & (and), | (or), and ! (not) in the filtering expression
filter(nycflights13::flights, month == 11 | month == 12)
Another useful logical operator is %in%. This will select every row where x is one of the values in y. We could use it to rewrite the code above:
(nov_dec <- filter(nycflights13::flights, month %in% c(11, 12)))
Another example of filtering:
(filter(nycflights13::flights, arr_delay <= 120, dep_delay <= 120))
1.- Filter all the flights with a delay bigger than two hours at the arrivals
2.- Filter all the delayed flights in autumn
Sometimes we want to have our data sorted in a particular way so that we can better inspect it, for this dplyr provides us with the arrange() function that lets us change the ordering of the rows. The arrange() function takes a data frame and a set of column names and orders the data according to the values of the columns provided (of course, there must be an implicit order to the column values). If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
(arrange(nycflights13::flights, desc(arr_delay)))
When we have a data set with several features (columns), we will sometimes want to work with only a subset of them, this is when the select()verb comes handy. It allows you to pick columns in order to make a subset using operations based on the names of the variables. Thereby, you can get a data frame with only the variables relevant for a specific task
(select(nycflights13::flights, year, month, day))
Since data frames are ordered (both in rows and in columns), you can select a range of columns. With the following code you are selecting all the columns between year and sched_dep_time.
# Select all columns between year and day (inclusive)
select(nycflights13::flights, year:sched_dep_time)
Make a multiple selection using a list of columns as input.
Hint: you can define a list (a vector) in R like this: v = c(1,2,3,4)
Another common task in data analysis is the creation of new variables based on the values of existing ones. The mutate() function lets you add new columns to the end of the data frame that are functions of existing ones.
For this example we will create a narrower (fewer variables) data set called flights_sml that contains a subset of the original variables and add two new columns: gain and speed.
flights_sml <- select(nycflights13::flights,
year:day,
ends_with("delay"),
distance,
air_time
)
(flights_sml <- mutate(flights_sml,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60
))
Notice how we created a variable named flights_sml to hold our subseted data and then appended the two new columns to it (effectively reassigning the variable). We also used a new operator while subseting our data: ends_with(), this takes a string as argument and returns every name that has our string at the end.
One useful feature of mutate is that you can use the columns you are creating within the same call.
mutate(flights_sml,
gain = arr_delay - dep_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
The final dplyr verb is summarise(), this lets us aggregate the values on our data frame according to some expression and return a summary of our data. The following code will create the variable delay by summarizing the mean of dep_delay without counting the null values.
summarise(nycflights13::flights, delay = mean(dep_delay, na.rm = TRUE))
Summaries become more useful when coupled with a grouping operation, this way we can produce summaries (aggregations) for specific groups of data. For example, lets suppose we want to know the average delay for each month of the year, then we can group our observations by month and then use the mean to represent our data groups.
by_month <- group_by(nycflights13::flights, month)
(monthly_avg <- summarise(by_month, delay = mean(dep_delay, na.rm = TRUE)))
You can use any aggregate function when summarizing data, add the standard deviation to the above data frame.
of course, we can visualize our transformed data to get an idea of how the month influences the average delay time
ggplot(data = monthly_avg) +
geom_point(mapping = aes (x = month, y= delay))
Lets try to answer a more complicated question, for example, lets find out weather flight distance has any influence on the delay time.
The first thing we need to do is group our data by destination and calculate the mean distance, then we will filter out groups having less than 20 observations (to assure we have enough data in each group). Finally, we will plot the resulting data to get an idea of the relationship between distance and delays.
by_dest <- group_by(nycflights13::flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
It looks like delays increase with distance up to ~750 miles and then decrease. Maybe as flights get longer there’s more ability to make up delays in the air?
Look into the former hypothesis, may be looking at average speed?
Looking at the last chunk of code, we can see that in order to obtain our result we needed to define two intermediate variables (in fact only delay, but it is assigned twice!). This makes the code cumbersome and difficult to read. The maggritter package (part of the diversity), provides us with a more concise way to chain operations: the pipe %>%
delay <- group_by(nycflights13::flights, dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'
The pipe operator chains the output of one operation to the next, thus the delay variable will hold the result of chaining the group by with the summarise and the filter operations. Although not strictly necessary, the use of pipes makes easier reading code and produces less clutter in our variable space.
Think of an hypothesis you want to test from the data and use a combination of operators to test it out. Of course, your analysis should end up with a visualization!